CSDA Lab, Mathematics and Statistics Department, University of West Florida
High dimensional data refers to data with large number of features (co-variates) \(p\), formally we can write data is \(\mathbf{X} \in \mathbb{R} ^{n\times p}\):
\[ p \gg n, \tag{1} \] where \(n\) is the number of observations.
In this context, many challenges arise:
Some solutions in the literature:
Our data is a mass spectrum signal data (functional data).
The Fourier Transform of a signal \(x(t)\) can be expressed as:
\[ X(f)= \int_{-\infty}^{\infty} x(t) e^{i2 \pi ft} dt \tag{2} \](\(e^{ix}= \cos x + i \sin x\), Euler’s formula); \(f\) is the frequency domain.
The Wavelet Transform of a signal \(x(t)\) can be given as:
\[ WT(s,\tau)= \frac{1}{\sqrt s}\int_{-\infty}^{\infty} x(t) \psi^*\big(\frac{t-\tau}{s}\big) dt, \tag{3} \]
where \(\psi^*(t)\) denotes the complex conjugate of the base wavelet \(\psi(t)\)); \(s\) is the scaling parameter, and \(\tau\) is the location parameter.
Example: Morlet Wavelet \(\psi(t) = e^{i2 \pi f_0t} e^{-(\alpha t^2/\beta^2)}\), with the parameters \(f_0\), \(\alpha\), \(\beta\) all being constants.
A workflow for ML is the following:
Data Collection
Data Processing: Clean, Explore, Prepare, Transform
Modeling: Develop, Train, Validate, and Evaluate,
Deployment: Deploy, Monitor and Update
Go to 1.
We designed a statistical experiment to evaluation 4 different processing approaches.
Variables of the experimental design:
Four pre-processing techniques.
5 window sizes.
Two are wavelet-based and two are not.
10 wavelets families
Four ML Models: Logistic Regression, Support Vector Machine, Random Forest, and XGboost.
Two sampling: Over and under sampling to overcome the imbalance classes
Repeat 100 times each case.
A total of 88000 models were run.
Processing 1 (PROC1): The feature space includes mean, variance, energy, coefficient of variation, Skewness, and Kurtosis; wavelet transform.
Processing 2 (PROC2): Same as PROC1 but the feature space will include the first 10 autocorrelation coefficients.
Processing 3 (PROC3): Same as PROC1 but without the wavelet transform.
Processing 4 (PROC4): Same as PROC2 but without the wavelet transform.
The performance metrics utilized were:
Observed 32,768 m/z values / 33,885 m/z values
Link: https://bioinformatics.mdanderson.org/public-datasets/
Joint Mathematics Meetings | Jan 8-11, 2025 | Seattle